87 research outputs found

    CAFTAN: a tool for fast mapping, and quality assessment of cDNAs

    Get PDF
    Background: The German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics. Results: CAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs. Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping exons and the structural classification of cDNAs with respect to the reference set of splice variants. The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon cDNAs and 85 % of the multiple exon cDNAs. The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like EST-annotation, or to extend it by adding new classification rules and new organism databases as they become available. We think that it is a very useful program for the annotation and research of unfinished genomes. Conclusion: CAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel transcripts for new experiments.German Federal Ministry of Education and Research 01GR0101 and 01GR0420 and 01GR045

    Rhomboid Protease Dynamics and Lipid Interactions

    Get PDF
    Intramembrane proteases, which cleave transmembrane (TM) helices, participate in numerous biological processes encompassing all branches of life. Several crystallographic structures of Escherichia coli GlpG rhomboid protease have been determined. In order to understand GlpG dynamics and lipid interactions in a native-like environment, we have examined the molecular dynamics of wild-type and mutant GlpG in different membrane environments. The irregular shape and small hydrophobic thickness of the protein cause significant bilayer deformations that may be important for substrate entry into the active site. Hydrogen-bond interactions with lipids are paramount in protein orientation and dynamics. Mutations in the unusual L1 loop cause changes in protein dynamics and protein orientation that are relayed to the His-Ser catalytic dyad. Similarly,mutations in TM5 change the dynamics and structure of the L1 loop. These results imply that the L1 loop has an important regulatory role in proteolysis.National Institute of General Medical Sciences (GM-74637

    cDNA2Genome: A tool for mapping and annotating cDNAs

    Get PDF
    BACKGROUND: In the last years several high-throughput cDNA sequencing projects have been funded worldwide with the aim of identifying and characterizing the structure of complete novel human transcripts. However some of these cDNAs are error prone due to frameshifts and stop codon errors caused by low sequence quality, or to cloning of truncated inserts, among other reasons. Therefore, accurate CDS prediction from these sequences first require the identification of potentially problematic cDNAs in order to speed up the posterior annotation process. RESULTS: cDNA2Genome is an application for the automatic high-throughput mapping and characterization of cDNAs. It utilizes current annotation data and the most up to date databases, especially in the case of ESTs and mRNAs in conjunction with a vast number of approaches to gene prediction in order to perform a comprehensive assessment of the cDNA exon-intron structure. The final result of cDNA2Genome is an XML file containing all relevant information obtained in the process. This XML output can easily be used for further analysis such us program pipelines, or the integration of results into databases. The web interface to cDNA2Genome also presents this data in HTML, where the annotation is additionally shown in a graphical form. cDNA2Genome has been implemented under the W3H task framework which allows the combination of bioinformatics tools in tailor-made analysis task flows as well as the sequential or parallel computation of many sequences for large-scale analysis. CONCLUSIONS: cDNA2Genome represents a new versatile and easily extensible approach to the automated mapping and annotation of human cDNAs. The underlying approach allows sequential or parallel computation of sequences for high-throughput analysis of cDNAs

    Profile analysis and prediction of tissue-specific CpG island methylation classes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissue- specific methylation pattern.</p> <p>Results</p> <p>We defined CGI methylation profiles that separate not only between constitutively methylated and unmethylated CGIs, but also identify CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. These profiles are clearly distinguished by a number of CGI attributes including their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Additionally, we assess profile functionality with respect to the different compartments of protein coding genes and their possible use in the prediction of DNA methylation.</p> <p>Conclusion</p> <p>Our approach provides new insights into the biological features that determine if a CGI has a functional role in the epigenetic control of gene expression and the features associated with CGI methylation susceptibility. Moreover, we show that the ability to predict CGI methylation is based primarily on the quality of the biological information used and the relationships uncovered between different sources of knowledge. The strategy presented here is able to predict, besides the constitutively methylated and unmethylated classes, two more tissue specific methylation classes conserving the accuracy provided by leading binary methylation classification methods.</p

    Profile analysis and prediction of tissue-specific CpG island methylation classes

    Get PDF
    Background: The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissuespecific methylation pattern. Results: We defined CGI methylation profiles that separate not only between constitutively methylated and unmethylated CGIs, but also identify CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. These profiles are clearly distinguished by a number of CGI attributes including their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Additionally, we assess profile functionality with respect to the different compartments of protein coding genes and their possible use in the prediction of DNA methylation. Conclusion: Our approach provides new insights into the biological features that determine if a CGI has a functional role in the epigenetic control of gene expression and the features associated with CGI methylation susceptibility. Moreover, we show that the ability to predict CGI methylation is based primarily on the quality of the biological information used and the relationships uncovered between different sources of knowledge. The strategy presented here is able to predict, besides the constitutively methylated and unmethylated classes, two more tissue specific methylation classes conserving the accuracy provided by leading binary methylation classification methods.publishedVersionPeer Reviewe

    Cis-cop: Multiobjective identification of cis-regulatory modules based on constrains

    Get PDF
    Gene expression regulation is an intricate, dynamic phenomenon essential for all biolog ical functions. The necessary instructions for gen expression are encoded in cis-regulatory elements that work together and interact with the RNA polymerase to confer specific spatial and temporal patterns of transcrip tion. Therefore, the identification of these el ements is currently an active area of research in computational analysis of regulatory se quences. However, the problem is difficult since the combinatorial interactions between the regulating factors can be very complex. Here we present a web server, Cis-cop, that identifies cis-regulatory modules given a set of transcription factor binding sites and, ad ditionally, also RNA pol sites for a group of genes

    Optimization of multi-classifiers for computational biology: application to gene finding and expression

    Get PDF
    Genomes of many organisms have been sequenced over the last few years. However, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed to address part of this problem: the location of genes along a genome and their expression. We propose a multi-objective methodology to combine state-of-the-art algorithms into an aggregation scheme in order to obtain optimal methods’ aggregations. The results obtained show a major improvement in sensitivity when our methodology is compared to the performance of individual methods for gene finding and gene expression problems. The methodology proposed here is an automatic method generator, and a step forward to exploit all already existing methods, by providing alternative optimal methods’ aggregations to answer concrete queries for a certain biological problem with a maximized accuracy of the prediction. As more approaches are integrated for each of the presented problems, de novo accuracy can be expected to improve further.Ministerio de Ciencia y Tecnología TIN2006-12879Junta de Andalucía TIC-0278

    Uncovering the complex genetic architecture of human plasma lipidome using machine learning methods

    Get PDF
    Genetic architecture of plasma lipidome provides insights into regulation of lipid metabolism and related diseases. We applied an unsupervised machine learning method, PGMRA, to discover phenotype-genotype many-to-many relations between genotype and plasma lipidome (phenotype) in order to identify the genetic architecture of plasma lipidome profiled from 1,426 Finnish individuals aged 30–45 years. PGMRA involves biclustering genotype and lipidome data independently followed by their inter-domain integration based on hypergeometric tests of the number of shared individuals. Pathway enrichment analysis was performed on the SNP sets to identify their associated biological processes. We identified 93 statistically significant (hypergeometric p-value < 0.01) lipidomegenotype relations. Genotype biclusters in these 93 relations contained 5977 SNPs across 3164 genes. Twenty nine of the 93 relations contained genotype biclusters with more than 50% unique SNPs and participants, thus representing most distinct subgroups. We identified 30 significantly enriched biological processes among the SNPs involved in 21 of these 29 most distinct genotype-lipidome subgroups through which the identified genetic variants can influence and regulate plasma lipid related metabolism and profiles. This study identified 29 distinct genotype-lipidome subgroups in the studied Finnish population that may have distinct disease trajectories and therefore could be useful in precision medicine research.Research Council of FinlandSocial Insurance Institution of FinlandCompetitive State Research Financing of Expert Responsibility area of Kuopio, Tampere and Turku University HospitalsJuho Vainio FoundationPaavo Nurmi FoundationFinnish Foundation for Cardiovascular ResearchFinnish Cultural Foundation Finnish IT center for scienceSigrid Juselius FoundationTampere Tuberculosis FoundationEmil Aaltonen FoundationYrjo Jahnsson FoundationSigne and Ane Gyllenberg FoundationDiabetes Research Foundation of Finnish Diabetes Association 322098 286284 134309 126925 121584 124282 255381 256474 283115 319060 320297 314389 338395 330809 104821 129378 117797 141071 INFRAIA-2016-1-730897Horizon 2020European Research Council (ERC) European Commission 349708Tampere University Hospital Supporting FoundationFinnish Society of Clinical ChemistrySpanish Government RTI2018-098983-B-100Laboratoriolaaketieteen Edistamissaatio~SrIda Montinin saatioKalle Kaiharin saatioAarne Koskelon saatioFaculty of Medicine and Health Technology, Tampere UniversityProject HPC-EUROPA3 X51001 50191928EC Research Innovation Action under H2020 Programme 75532

    Optimization of multi-classifiers for computational biology: application to gene finding and expression

    Get PDF
    Genomes of many organisms have been sequenced over the last few years. However, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed to address part of this problem: the location of genes along a genome and their expression. We propose a multi-objective methodology to combine state-of-the-art algorithms into an aggregation scheme in order to obtain optimal methods’ aggregations. The results obtained show a major improvement in sensitivity when our methodology is compared to the performance of individual methods for gene finding and gene expression problems. The methodology proposed here is an automatic method generator, and a step forward to exploit all already existing methods, by providing alternative optimal methods’ aggregations to answer concrete queries for a certain biological problem with a maximized accuracy of the prediction. As more approaches are integrated for each of the presented problems, de novo accuracy can be expected to improve further.Ministry of Science and Innovation, Spain (MICINN) Spanish Government TIN-2006-12879Junta de Andalucia TIC-02788Howard Hughes Medical InstituteEuropean Commission Junta de Andaluci

    Identification of differentially expressed small non-coding RNAs in the legume endosymbiont Sinorhizobium meliloti by comparative genomics

    Get PDF
    Bacterial small non-coding RNAs (sRNAs) are being recognized as novel widespread regulators of gene expression in response to environmental signals. Here, we present the first search for sRNA-encoding genes in the nitrogen-fixing endosymbiont Sinorhizobium meliloti, performed by a genome- wide computational analysis of its intergenic regions. Comparative sequence data from eight related alpha-proteobacteria were obtained, and the interspecies pairwise alignments were scored with the programs eQRNA and RNAz as complementary predictive tools to identify conserved and stable secondary structures corresponding to putative non-coding RNAs. Northern experiments confirmed that eight of the predicted loci, selected among the original 32 candidates as most probable sRNA genes, expressed small transcripts. This result supports the combined use of eQRNA and RNAz as a robust strategy to identify novel sRNAs in bacteria. Furthermore, seven of the transcripts accumulated differentially in free-living and symbiotic conditions. Experimental mapping of the 5 '-ends of the detected transcripts revealed that their encoding genes are organized in autonomous transcription units with recognizable promoter and, in most cases, termination signatures. These findings suggest novel regulatory functions for sRNAs related to the interactions of alpha-proteobacteria with their eukaryotic hosts.Spanish Ministerio de EducaciĂłn y Ciencia (Project AGL2006-12466/AGR)Junta de AndalucĂ­a (Project CV1-01522)NIH Grant 1R01GM070538-02FPI Fellowship from the Spanish Ministerio de EducaciĂłn y Cienci
    • …
    corecore